HBO Universe

Lights, camera, action!

For our final project, we’re going to take a deep dive into the world of HBO movies and TV shows. HBO has been providing quality content to its viewers for decades, but have you ever been stuck on what to watch? By analyzing ratings, popularity, and your favorite actors/actresses, lets take a look at what HBO can offer to its loyal fans.

What we will be looking into?

  • Movies and shows on HBO

  • What does the distribution of Genres look like for both movies and shows?

  • Correlations between age_restriction and genre?

  • Actors and Directors involved with HBO

  • Actors and Directors associated with highest-rated titles

  • Distribution of Number of Movies and Shows by Release Year

  • Top 10 most popular Movies and Shows

  • Correlation between Run-time of Movies and its Popularity?

  • Correlation between Genre and Popularity?

  • Correlation between the number of movies and shows produced and what country they come from?

Installing packages

# install.packages("rmarkdown")

#load tidyverse to manipulate data
#load ggplot2 for graphing
#load shiny for graphing
#load dplyr to manipulate data
#load knitr for general-purpose literate programming
#load kableExtra to add features to table
#load maps for map graph

library(rmarkdown)  ## used for our output theme 'readthedown'
#render("mydocument.Rmd")
library(tidyverse)
library(ggplot2)
library(shiny)
library(dplyr)
library(countrycode)
library(knitr)
library(kableExtra)
library(maps)

About Our Data

The dataset we have chosen to work with is sourced from Kaggle and is the property of Diego Enrique. Here is the link to access it: https://www.kaggle.com/datasets/dgoenrique/hbo-max-movies-and-tv-shows

Additionally, we would like to provide the link for rmdformats library used for the ‘readthedown’ output theme used in our project, which was created by Julien Barnier. Here’s the link to the theme: https://mran.microsoft.com/snapshot/2019-12-15/web/packages/rmdformats/vignettes/introduction.html

Our data consists of 2 csv’s:

Titles data:

15 variables, 3030 observations

id: The title ID

title: The name of the title

show_type: Tv show or Movie

description: A description of movie or tv show

release_year: Year show/movie was released

age_certification: The age rating of movie or show

runtime: The length of the episode of show or movie in minutes

genres: A list of genres

production_countries: Countries that produced the show/movie

seasons: Number of seasons IF it is a show

imdb_id: The title ID on IMDB

imdb_score: Score on IMDB

imdb_votes: Votes on IMDB

tmdb_popularity: Popularity on TMDB

tmdb_score: Score on TMDB

Credits data:

5 variables, 64879 observations

person_ID: The person ID on JustWatch

id: The title ID on JustWatch

name: The name of actor or director

character_name: The name of character played in movie/show

role: ACTOR or DIRECTOR

Let us read our datas, shall we?

We’re using the kable and head function to show a part of the data sets we’re working on but in an organized manner

Here’s our credits.csv

Sample table of credits data
person_id id name character role
14701 tm77588 Humphrey Bogart Rick Blaine ACTOR
14702 tm77588 Ingrid Bergman Ilsa Lund ACTOR
14703 tm77588 Paul Henreid Victor Laszlo ACTOR
14704 tm77588 Claude Rains Captain Louis Renault ACTOR
14705 tm77588 Conrad Veidt Major Heinrich Strasser ACTOR
14706 tm77588 Sydney Greenstreet Signor Ferrari ACTOR

And here’s our titles.csv

Sample table of titles data
id title type release_year age_certification runtime genres production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity tmdb_score
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167
tm155702 The Wizard of Oz MOVIE 1939 G 102 [‘fantasy’, ‘family’] [‘US’] NA tt0032138 8.1 406105 56.631 7.583
tm83648 Citizen Kane MOVIE 1941 PG 119 [‘drama’] [‘US’] NA tt0033467 8.3 446627 19.900 8.022
tm3175 Meet Me in St. Louis MOVIE 1945 113 [‘drama’, ‘family’, ‘romance’, ‘music’, ‘comedy’] [‘US’] NA tt0037059 7.5 25589 8.311 7.000
ts225761 Tom and Jerry SHOW 1940 8 [‘animation’, ‘comedy’, ‘family’, ‘action’] [‘US’] 16 tt6422744 7.7 859 1.400 10.000
tm156463 Gone with the Wind MOVIE 1940 G 238 [‘drama’, ‘romance’, ‘war’, ‘history’] [‘US’] NA tt0031381 8.2 319463 27.535 8.000

What will it look like if we try to combine these data sets?

both_data <- left_join(titles, credits, by = "id")

kable(head(both_data),
      align = "c",
      caption = "<b><center>Sample table of both data",
      format = "html") %>% 
  kable_styling(bootstrap_options = "bordered", full_width = FALSE)
Sample table of both data
id title type release_year age_certification runtime genres production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity tmdb_score person_id name character role
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14701 Humphrey Bogart Rick Blaine ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14702 Ingrid Bergman Ilsa Lund ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14703 Paul Henreid Victor Laszlo ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14704 Claude Rains Captain Louis Renault ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14705 Conrad Veidt Major Heinrich Strasser ACTOR
tm77588 Casablanca MOVIE 1943 PG 102 [‘drama’, ‘romance’, ‘war’] [‘US’] NA tt0034583 8.5 577842 22.005 8.167 14706 Sydney Greenstreet Signor Ferrari ACTOR

Because we have multiple pero

Let’s begin by determining the number or movies and TV shows we are working with

Type Count
MOVIE 2408
SHOW 622

Wow! that’s a lot more movies than shows! But let’s see it visually

What’s the distribution of genres for both Shows and Movies in our dataset?

Now, let us see if there’s a correlation between age_restriction and genres

## Unique age certifications:  PG, G, PG-13, R, TV-G, TV-Y, TV-Y7, TV-PG, NC-17, TV-14, TV-MA, TV-Y7-FV

Well that did not work as expected. Let’s see if a geom_tile graph does the job:

Now lets look at our Actor and Director columns in our credits data.

role n
ACTOR 62158
DIRECTOR 2721

Since actors and directors can have multiple projects, lets remove the duplicates

Unique name for actors
unique_names_of_actors
43930
Unique name for directors
unique_names_of_directors
1730

Are any of these actors/directors in multiple projects? If so, who was in the most projects?

person_id name role total_projects
14142 Grey DeLisle ACTOR 60
529 Frank Welker ACTOR 50
20372 Tara Strong ACTOR 42
7997 Kevin Michael Richardson ACTOR 36
6821 Fred Tatasciore ACTOR 35
18723 Dee Bradley Baker ACTOR 35
person_id name role total_projects
21759 Charlie Chaplin DIRECTOR 22
27098 Sam Liu DIRECTOR 17
69510 Jon Alpert DIRECTOR 15
106013 Yasujirō Ozu DIRECTOR 13
210814 Satyajit Ray DIRECTOR 13
192306 Alexandra Pelosi DIRECTOR 11

Which actors and directors are among the highiest-rated scores based on an average of IMDB and TMDB scores

Actors/Directors highiest-rated titles
person_id name role title average_score imdb_score tmdb_score
6653 Carlos Alazraqui ACTOR Crashbox 9.2500 8.5 10.000
2990 Ned Bellamy ACTOR The Shawshank Redemption 9.0010 9.3 8.702
237364 Jessie Buckley ACTOR Chernobyl 9.0000 9.4 8.600
775128 JC Lin ACTOR The World Between Us 8.9610 8.9 9.022
29657 Dennis McNicholas DIRECTOR Batman: The Audio Adventures 8.9500 8.4 9.500
764152 Bella Ramsey ACTOR The Last of Us 8.9490 9.1 8.798
24613 Damian Lewis ACTOR Band of Brothers 8.9270 9.4 8.454
91103 Justin Roiland ACTOR Rick and Morty 8.9145 9.1 8.729
9439 Lorraine Bracco ACTOR The Sopranos 8.9015 9.2 8.603
4252 Clarke Peters ACTOR The Wire 8.9005 9.3 8.501

Here is the distribution of shows and movies available in HBO by release year

Here’s the summary table of what the graph shows

Number of Shows and Movies Available by Year
decade MOVIE SHOW
1900s 8 NA
1910s 12 NA
1920s 35 1
1930s 44 NA
1940s 57 1
1950s 91 NA
1960s 130 6
1970s 109 3
1980s 170 6
1990s 265 43
2000s 395 94
2010s 706 230
2020s 274 148
NA 112 90

This indicates us that HBO primarily features Movies and Shows from the decade of 2010s

You can see there is a wide range of movies and TV shows, especially what year they were released.

I wonder what the oldest movies and shows are?

Oldest Movie on HBO
title release_year genres
The Prince of Magicians 1901 [‘comedy’]
Oldest Show on HBO
title release_year genres
Looney Tunes 1929 [‘comedy’, ‘family’, ‘thriller’, ‘animation’]
Newest Movie on HBO
title release_year genres
Marc Maron: From Bleak to Dark 2023 [‘comedy’, ‘documentation’]
Newest Show on HBO
title release_year genres
The Last of Us 2023 [‘drama’, ‘action’, ‘horror’, ‘scifi’, ‘thriller’]

Now explore if there’s a relationship between longest movie and its popularity?

Since we’re looking at runtimes, lets see what’s HBO’s shortest movie and show and the longest movie and show

Shortest movie on HBO
title runtime release_year genres
An Impossible Balancing Feat 1 1902 []
Longest movie on HBO
title runtime release_year genres
Scenes from a Marriage 299 1974 [‘drama’, ‘european’]
Shortest Show on HBO
title runtime seasons release_year genres
Meet the Batwheels 2 1 2022 [‘animation’, ‘action’]
Longest Show on HBO
title runtime seasons release_year genres
Sesame Street 51 53 1969 [‘comedy’, ‘animation’, ‘family’, ‘fantasy’, ‘music’]

Last but not least, let us look at the Number of movies and TV shows by country

Unfortunately, because HBO only got their movies and shows from 99 countries, there are some uncolored countries

Here’s the summary table of what the map shows

Number of movies and TV shows by country
production_countries full_country_name MOVIE SHOW
US USA 1824 491
GB United Kingdom 270 38
FR France 178 5
JP Japan 112 7
DE Germany 87 5
CA Canada 75 7
IT Italy 54 4
ES Spain 28 19
MX Mexico 26 6
AU Australia 24 1
SE Sweden 22 NA
IN India 20 1
CH Switzerland 16 NA
HK Hong Kong SAR China 14 NA
CN China 13 1
DK Denmark 13 1
BE Belgium 12 NA
BR Brazil 4 12
NZ New Zealand 12 1
PL Poland 11 2
SU NA 11 NA
ZA South Africa 10 NA
AR Argentina 8 6
AT Austria 8 NA
IE Ireland 8 NA
NL Netherlands 7 NA
AE United Arab Emirates 6 NA
PR Puerto Rico 6 NA
SG Singapore 1 6
TW Taiwan NA 6
BG Bulgaria 5 NA
CO Colombia 5 NA
IL Israel 5 5
SN Senegal 5 NA
CZ Czechia 4 2
DO Dominican Republic 4 NA
GR Greece 4 NA
KR South Korea 4 NA
XC NA 4 NA
BO Bolivia 3 NA
CL Chile 3 3
CU Cuba 3 NA
EC Ecuador 3 NA
HU Hungary 3 1
IS Iceland 3 NA
NO Norway 3 NA
PT Portugal 3 NA
UY Uruguay 3 1
DZ Algeria 2 NA
FI Finland 2 NA
ID Indonesia NA 2
IR Iran 2 NA
LU Luxembourg 2 NA
NG Nigeria 2 NA
PE Peru 2 NA
PK Pakistan 2 NA
RO Romania 2 2
RU Russia 2 1
TH Thailand 2 NA
TR Turkey 2 NA
AF Afghanistan 1 NA
BS Bahamas 1 NA
EG Egypt 1 NA
GT Guatemala 1 NA
KH Cambodia 1 NA
MA Morocco 1 NA
MC Monaco 1 NA
MK North Macedonia 1 NA
PA Panama 1 NA
PH Philippines 1 1
PY Paraguay 1 NA
RW Rwanda 1 NA
UA Ukraine 1 NA

Final Summary

Hopefully, after the end of this, we may have created some tools to help you when you can’t figure out what to watch. If you are looking for more variety, you would probably want to stick with movies as you have all types of options to choose from. Same with if you are wanting something newer. Based on our movies and shows by distribution graph, you will definitely have more options from the 1990’s - 2000’s. If you want to stick with a specific genre, our genre and popularity graph will help you with that, for example it shows that the drama genre is the most popular.